Various ways to accomplish most data wrangling operations with both data.table and the Tidyverse.
Data Manipulation
Tidyverse
data.table
R
First Published
May 19, 2022
Summary
This post showcases various ways to accomplish most data wrangling operations, from basic filtering/mutating to pivots and non-equi joins, with both data.table and the Tidyverse (dplyr, tidyr, purrr, stringr).
🆕 Expand for Version History
v1: 2022-05-19
v2: 2022-05-26
Improved the section on keys (for ordering & filtering)
Adding a section for translations of Tidyr (and other similar packages)
Capping tables to display 15 rows max when unfolded
Improving table display (stripping, hiding the contents of nested columns, …)
v3: 2022-07-20
Updating data.table’s examples of dynamic programming using env
Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.
1.1.3 Ordering with (secondary) indices
setindex creates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index
We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !
To filter by indices, we can use the on argument, which creates a temporary secondary index on the fly (if it doesn’t already exist).
IRIS["setosa", on ="Species"]
data.table [50 x 5]
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species
5.1
3.5
1.4
0.2
setosa
4.9
3
1.4
0.2
setosa
4.7
3.2
1.3
0.2
setosa
4.6
3.1
1.5
0.2
setosa
5
3.6
1.4
0.2
setosa
5.4
3.9
1.7
0.4
setosa
4.6
3.4
1.4
0.3
setosa
5
3.4
1.5
0.2
setosa
4.4
2.9
1.4
0.2
setosa
4.9
3.1
1.5
0.1
setosa
5.4
3.7
1.5
0.2
setosa
4.8
3.4
1.6
0.2
setosa
4.8
3
1.4
0.1
setosa
4.3
3
1.1
0.1
setosa
5.8
4
1.2
0.2
setosa
[ omitted 35 entries ]
Since the time to compute the secondary indices is quite small, we don’t have to use setindex, unless the task involves repeated subsetting on the same columns.
Tip
When using on with multiple values, the nomatch = NULL argument avoids creating combinations that do not exist in the original data (i.e. for cyl == 5 here)
MT[.(4:6, 4), on =c("cyl", "gear"), nomatch =NULL]
data.table [12 x 11]
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
22.8
4
108
93
3.85
2.32
18.61
1
1
4
1
24.4
4
146.7
62
3.69
3.19
20
1
0
4
2
22.8
4
140.8
95
3.92
3.15
22.9
1
0
4
2
32.4
4
78.7
66
4.08
2.2
19.47
1
1
4
1
30.4
4
75.7
52
4.93
1.615
18.52
1
1
4
2
33.9
4
71.1
65
4.22
1.835
19.9
1
1
4
1
27.3
4
79
66
4.08
1.935
18.9
1
1
4
1
21.4
4
121
109
4.11
2.78
18.6
1
1
4
2
21
6
160
110
3.9
2.62
16.46
0
1
4
4
21
6
160
110
3.9
2.875
17.02
0
1
4
4
19.2
6
167.6
123
3.92
3.44
18.3
1
0
4
4
17.8
6
167.6
123
3.92
3.44
18.9
1
0
4
4
1.2.8 Filtering on multiple columns
Filtering with one function taking multiple columns:
f_dat<- \(d)with(d, gear>cyl)# Function taking the data and comparing fix columnsf_dyn<- \(x, y)x>y# Function taking dynamic columns and comparing them
But we can bypass that constraint by doing the operation in two steps:
- Obtaining a vector stating if each row of the table matches or not the conditions
- Filtering the original table based on the vector
MT[MT[, f_dat(.SD)]]
data.table [2 x 11]
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
26
4
120.3
91
4.43
2.14
16.7
0
1
5
2
30.4
4
95.1
113
3.77
1.513
16.9
1
1
5
2
Combining multiple filtering functions:
This function filters rows that have 2 or more non-zero decimals, and we’re going to call it on multiple columns:
data.table can mutate in 2 ways:
- Using = creates a new DT with the new columns only (like dplyr::transmute)
- Using := (or let) modifies the current dt in place (like dplyr::mutate)
The function modifying a column should be the same size as the original column (or group).
If only one value is provided with :=, it will be recycled to the whole column/group.
If the number of values provided is smaller than the original column/group:
- With := or let, an error will be raised, asking to manually specify how to recycle the values.
- With =, it will behave like dplyr::summarize (if a grouping has been specified).
With data.table, one needs to use the = operator to summarize. It takes a function that returns a list of values smaller than the original column (or group) size. By default, it will only keep the modified columns (like a transmute).
1.9.3.3 Apply multiple functions to multiple columns:
Note
Depending on the output we want (i.e. having the function’s output as columns or rows), we can either provide a list of functions to apply (list_of_fns), or a function returning a list (fn_returning_list).
melt(FAM2, measure =measurev(list(value.name =NULL, child =as.integer), pat ="(.*)_child(\\d)"))
2.2 Dcast / Wider
General idea:
- Pivot around the combination of id.vars (LHS of the formula)
- The measure.vars (RHS of the formula) are the ones whose values become column names
- The value.var are the ones the values are taken from to fill the new columns
In data.table, not specifying the column holding the measure vars (the names) will result in an empty column counting the number of columns that should have been created for all the measures (i.e. the length() of the result).
dcast(MOVIES_LONG, ...~Genre, value.var ="Genre", fun = \(x)!is.na(x), fill =FALSE)
data.table [6 x 5]
ID
OtherCol
action
adventure
animation
1
0.301
TRUE
FALSE
FALSE
2
0.205
FALSE
TRUE
FALSE
2
0.651
TRUE
FALSE
FALSE
3
0.163
TRUE
FALSE
FALSE
3
0.335
FALSE
TRUE
FALSE
3
0.545
FALSE
FALSE
TRUE
3 Joins
Tip
A JOIN is just a special type of SUBSET: we subset the rows of one table based on the matching rows of a second one. And the matching conditions define what type of join we are applying.
3.1 Mutating Joins
The purpose of mutating joins is to add columns/information from one table to another, by matching their rows.
Both left & right joins append the columns of one table to those of another, in the order they are given (i.e. columns of the first table will appear first in the result). However, how rows are matched (and how the ones not finding a match are handled) depends on the type of join:
- Left joins match on the rows of the first (left) table. Unmatched rows from the left table will be kept, but not the right’s.
- Right joins match on the rows of the second (right) table. Unmatched rows from the right table will be kept, but not the left’s.
Example
To find out which country each city belongs to, we’re going to merge countries into cities.
Here, we want to add data to the cities table by matching each city to a country (by their country_id). The ideal output would have the columns of cities first, and keep all rows from cities, even if unmatched: thus we will use a left join.
left_join(cities, countries, by ="country_id", multiple ="all")
data.frame [10 x 4]
city_id
city
country_id
country
1
Barcelona
9
Spain
2
Bergen
8
Norway
3
Bern
10
NA
4
Helsinki
4
Finland
5
Linz
1
Austria
6
Punaauia
6
French Polynesia
7
Queenstown
7
New-Zealand
8
Rouen
5
France
9
Sosua
3
Dominican Republic
10
Trondheim
8
Norway
data.table natively only supports right joins
It filters the rows of the first table by those of the second (FIRST[SECOND]), but only keeps the unmatched rows from the second table.
The normal output of the join
CITIES[COUNTRIES, on =.(country_id)]
data.table [10 x 4]
city_id
city
country_id
country
5
Linz
1
Austria
NA
NA
2
Canada
9
Sosua
3
Dominican Republic
4
Helsinki
4
Finland
8
Rouen
5
France
6
Punaauia
6
French Polynesia
7
Queenstown
7
New-Zealand
2
Bergen
8
Norway
10
Trondheim
8
Norway
1
Barcelona
9
Spain
The unmatched rows from countries were kept, but not the ones from cities. Here are two possible workarounds:
Inverting the two tables (countries first), and then inverting the order of the columns in the result:
COUNTRIES[CITIES, .(city_id, city, country_id, country), on =.(country_id)]
data.table [10 x 4]
city_id
city
country_id
country
1
Barcelona
9
Spain
2
Bergen
8
Norway
3
Bern
10
NA
4
Helsinki
4
Finland
5
Linz
1
Austria
6
Punaauia
6
French Polynesia
7
Queenstown
7
New-Zealand
8
Rouen
5
France
9
Sosua
3
Dominican Republic
10
Trondheim
8
Norway
Adding the columns of countries (in-place) to cities during the join:
copy(CITIES)[COUNTRIES, c("country_id", "country"):=list(i.country_id, i.country), on =.(country_id)][]
data.table [10 x 4]
city_id
city
country_id
country
1
Barcelona
9
Spain
2
Bergen
8
Norway
3
Bern
10
NA
4
Helsinki
4
Finland
5
Linz
1
Austria
6
Punaauia
6
French Polynesia
7
Queenstown
7
New-Zealand
8
Rouen
5
France
9
Sosua
3
Dominican Republic
10
Trondheim
8
Norway
We could accomplish a similar result with a right join by inverting the order of appearance of the columns. But the order of the columns in the result will be less ideal (countries first):
HIERA[HIERA, on =.(manager_id =id), nomatch =NULL]
data.table [4 x 7]
id
first_name
last_name
manager_id
i.first_name
i.last_name
i.manager_id
2
Caine
Farrow
1
Maisy
Bloom
NA
3
Waqar
Jarvis
2
Caine
Farrow
1
4
Lacey-Mai
Rahman
2
Caine
Farrow
1
5
Merryn
French
3
Waqar
Jarvis
2
3.2 Filtering Joins
Use to filter one table (left) based on another (right): it will only keep the columns from the left table and will either keep (semi join) or discard (anti join) the rows where IDs match between both tables.
3.2.1 Semi join
Note
Will give the same result as an inner join, but will only keep the columns of the first table (no information is added).
Here, it will filter countries to only keep the countries having a matching country_id in the cities table.
Non-equi joins are joins where the the condition to match rows are no longer strict equalities between the tables’ ID columns.
We can divide non-equi joins between:
- Unequality joins: a general unequality condition between IDs, that could result in multiple matches.
- Rolling joins: only keep the match that minimizes the distance between the IDs (i.e. the closest to perfect equality).
- Overlap joins: matching to all values within a range.
Tip
Please refer to this page of the second edition of R4DS for more detailed explanations.
Data:
Events:
data.table [3 x 4]
e.id
event
e.start
e.end
1
Alice’s graduation
2023-06-05 10:00:00
2023-06-05 13:00:00
2
John’s birthday
2023-06-05 12:00:00
2023-06-05 22:00:00
3
Alice & Mark’s wedding
2023-06-07 13:00:00
2023-06-07 18:00:00
Strikes:
data.table [4 x 4]
s.id
strike_motive
s.start
s.end
1
Not enough cheese
2023-06-05 11:00:00
2023-06-05 20:00:00
2
Not enough wine
2023-06-05 14:00:00
2023-06-05 16:00:00
3
Life’s too expensive
2023-06-08 09:00:00
2023-06-08 20:00:00
4
Our team lost some sport event
2023-07-05 16:00:00
2023-07-05 22:00:00
3.3.1 Unequality join
Inequality joins are joins (left, right, inner, …) that use inequalities (<, <=, >=, or >) to specify the matching criteria.
Warning
The condition has to be a simple inequality between existing columns: it cannot be an arbitrary function (e.g. date.x <= min(date.y) * 2 will not work).
For each event, which strikes occurred (finished) before the event ?
EVENTS[STRIKES, on =.(e.start>=s.end), nomatch =NULL]
data.table [2 x 7]
e.id
event
e.start
e.end
s.id
strike_motive
s.start
3
Alice & Mark’s wedding
2023-06-05 20:00:00
2023-06-07 18:00:00
1
Not enough cheese
2023-06-05 11:00:00
3
Alice & Mark’s wedding
2023-06-05 16:00:00
2023-06-07 18:00:00
2
Not enough wine
2023-06-05 14:00:00
Caution
When specifying an equality or inequality condition, data.table will merge the two columns: only one will remain, with the values of the second column and the name of the first. Here, e.start will have the values of s.end (which will be removed).
I’m not sure if this is a bug or not.
A useful use-case for un-equality joins is to avoid duplicates when generating combinations of items in cross joins:
Rolling joins are a special type of inequality join where instead of getting every row that satisfies the inequality, we get the one where the IDs are the closest to equality.
dplyr provides three helper functions to make it easier to work with intervals:
- between(x, y_min, y_max) <=> x >= y_min, x <= y_max: a value of the first table is within a given range of the second
- within(x_min, x_max, y_min, y_max) <=> x_min >= y_min, x_max <= y_max: the ranges of the first table are contained within the second’s
- overlaps(x_min, x_max, y_min, y_max) <=> x_min <= y_max, x_max >= y_min: the two ranges overlap partially or totally, in any direction
Between: Which events had a strike staring in the two hours before the beginning of the event ?
Tip
First, we need to create the new “2 hours after the beginning of the event” column since we cannot use arbitrary functions in join_by() (e.g. we cannot do between(s.start, e.start, e.start + hours(2)))
By default, the value to match needs to be from the first table, and the range it falls within needs to be from the second table. Depending on the column order we need, this can force us to reorder the columns post-join (as in the above example).
This can be alleviated by manually specifying from which table each column comes from, using x$col and y$col (x referring the to first column).
@online{rivière2022,
author = {Rivière, Marc-Aurèle},
title = {Data Wrangling with Data.table and the {Tidyverse}},
pages = {undefined},
date = {2022-05-19},
url = {https://ma-riviere.com/content/code/posts/data.table},
langid = {en},
abstract = {This post showcases various ways to accomplish most data
wrangling operations, from basic filtering/mutating to pivots and
non-equi joins, with both `data.table` and the Tidyverse (`dplyr`,
`tidyr`, `purrr`, `stringr`).}
}
---title: "Data wrangling with data.table and the Tidyverse"subtitle: "Various ways to accomplish most data wrangling operations with both `data.table` and the Tidyverse."date: 2022-05-19abstract: "This post showcases various ways to accomplish most data wrangling operations, from basic filtering/mutating to pivots and non-equi joins, with both `data.table` and the Tidyverse (`dplyr`, `tidyr`, `purrr`, `stringr`)."website: open-graph: description: "Various ways to accomplish most data wrangling operations with both data.table and the Tidyverse" twitter-card: description: "Various ways to accomplish most data wrangling operations with both data.table and the Tidyverse"categories: - "Data Manipulation" - "Tidyverse" - "data.table" - "R"aliases: - /content/posts/data.table/execute: output: asis---{{< include /content/_hr.qmd >}}:::{.callout-tip collapse="true"}# 🆕 Expand for Version History**v1:** 2022-05-19 **v2:** 2022-05-26 - Improved the section on **keys** (for ordering & filtering) - Adding a [section](#tidyr-others) for translations of `Tidyr` (and other similar packages) - Capping tables to display 15 rows max when unfolded - Improving table display (stripping, hiding the contents of nested columns, ...)**v3:** 2022-07-20 - Updating `data.table`'s examples of dynamic programming using [`env`](https://rdatatable.gitlab.io/data.table/articles/datatable-programming.html)- Added new entries in [processing examples](#processing-examples)- Added new entries to [Tidyr & Others](#tidyr-others): expand + complete, transpose/rotation, ... - Added `pivot_wider` examples to match the `dcast` ones in the [Pivots](#pivots) section - Added some new examples here and there across the [Basic Operations](#basic-operations) section - Added an entry for operating inside nested data.frames/data.tables - Added a processing example for run-length encoding (i.e. successive event tagging)**v4:** 2022-08-05 - Improved `pivot` section: example of one-hot encoding (and reverse operation) + better examples of partial pivots with `.value`- Added `tidyr::uncount()` (row duplication) example. - Improved both light & dark themes (code highlight, tables, ...) **v5:** 2023-03-12 - Revamped the whole document with grouped tabsets by framework for better readability - Revamped the whole [Basic Operations](#basic-operations) section: better structure, reworked examples, ... - Revamped the whole [Joins](#joins) section: better structure, new examples (e.g. `join_by`), better explanations, ... - Updated code to reflect recent updates of the `Tidyverse`: - `dplyr` (1.1.0): `.by`, `reframe`, `join_by`, `consecutive_id`, ... - `purrr` (1.0.0): `list_rbind`, `list_cbind`, ... - `tidyr` (1.3.0): updated the `separate/separate_rows` section to the newer `separate_wider/longer_*`- Updated code to reflect recent updates of `data.table` (1.14.9): `let`, `DT()`, ... :::```{r}#| echo: false#| eval: false#| output: false# TODO:# URGENT: !!!!!!!!!!!!!!!!!!!!!!# - melt(id.cols = patterns("^[^dist_]"), measure = measure(method, pattern = "dist_(.*)")) # Measure gets everything not in id.cols# - melt(measure = measure(method, pattern = "dist_(.*)", cols = c("dist_sf", "dist_haversine"))) # Manual column names (can't use patterns here)# <=> pivot_longer(cols = starts_with("dist"), names_sep = "_", names_to = c(NA, "method"))# <=> pivot_longer(cols = starts_with("dist"), names_pattern = "dist_(.*)", names_to = c(NA, "method"))## See: ### - https://atrebas.github.io/post/2019-03-03-datatable-dplyr/### - https://atrebas.github.io/post/2020-06-14-datatable-pandas/# ----------------------------------------------------------## TIDYR 1.3:### Unnest_wider/longer## DATA.TABLE 1.14.9:### nafill(x, type = c("const", "locf", "nocb"))#### - locf: last observation carried forward#### - nocb: next observation carried backward## Time methods (eq to lubridate): https://rdatatable.gitlab.io/data.table/reference/IDateTime.html## JOINS:### - Add .EACHI example --> by = .EACHI == by = .I for both tables ?!!?#### See: https://scitilab.com/post_data/non_equi_joins/2020_11_17_non_equi_merge/#a-second-real-life-example#### DT <- data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)#### X <- data.table(x=c("c","b"), v=8:7, foo=c(4,2))#### DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI]## EXAMPLES:### - frollmean() frollsum() frollapply()### - rollup, cube, groupingsets# ----------------------------------------------------------## dplyr:### - group_map/modify/walk: DAT[, .(data = .(.SD)), by = group][, func(data[[1]]), by = group]### - groups/group_data/group_sizes/group_indices/group_vars/n_groups### - split() vs group_split()### - nest_join()### - with_order()```<!------------------------------------------------------------------------------><!------------------------------------------------------------------------------># Setup {.unnumbered}***```{r}#| echo: false#| output: falsesource(here::here("src", "init_min.R"), echo =FALSE)``````{r}#| echo: false#| eval: falserenv::install(c("here","readr","Rdatatable/data.table", # >= 1.14.9"dplyr", # >= 1.1.0"tidyr", # >= 1.3.0"pipebind","stringr","purrr", # >= 1.0.0"lubridate","broom","nplyr" ))``````{r}#| output: falselibrary(here) # Project managementlibrary(data.table) # Data wrangling (>= 1.14.9)library(dplyr) # Data wrangling (>= 1.1.0)library(tidyr) # Data wrangling (extras) (>= 1.3.0)library(pipebind) # Piping goodieslibrary(stringr) # Manipulating stringslibrary(purrr) # Manipulating lists (>= 1.0.0)library(lubridate) # Manipulating dateslibrary(broom)data.table::setDTthreads(parallel::detectCores(logical =FALSE))```:::{.callout-tip collapse="true"}# 💻 Expand for Session Info```{r}#| echo: false#| results: markupsi <- sessioninfo::session_info(pkgs ="attached")si$platform$Quarto <-system("quarto --version", intern =TRUE)si$platform$pandoc <-strsplit(si$platform$pandoc, "@")[[1]][1]si```:::```{r}#| echo: false## This section is for the html output (code-linking, ...)library(knitr)library(quarto)library(downlit)library(xml2)library(withr)``````{css, echo=FALSE}.panel-tabset > .tab-content { display: flex;}.panel-tabset > .tab-content > .tab-pane { display: block !important; visibility: hidden; margin-right: -100%; width: 100%;}.panel-tabset > .tab-content > .active { visibility: visible;}``````{r}#| echo: false#| output: false#| file: !expr here("src", "common", "knitr", "knit_print_gt_mono.R")```<!--------------------------------------------------------><!--------------------------------------------------------># Basic Operations***:::{.callout-tip appearance="simple"}## `data.table` general syntax:DT[`row selector` (filter/sort), `col selector` (select/mutate/summarize/reframe/rename), `modifiers` (group/join by)]:::**Data**```{r}MT <-as.data.table(mtcars)IRIS <-as.data.table(iris)[, Species :=as.character(Species)]```<!-------------------------------------------------------->## Arrange / Order### Basic ordering::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>arrange(desc(cyl))``````{r}mtcars |>arrange(desc(cyl), gear)```##### data.table```{r}MT[order(-cyl)]``````{r}MT[order(-cyl, gear)]``````{r}#| eval: false#| code-fold: trueMT[fsort(cyl, decreasing =TRUE)]setorder(MT, -cyl, gear)[]setorderv(MT, c("cyl", "gear"), c(-1 ,1))[]```**Ordering on a character column**```{r}IRIS[chorder(Species)]```:::### Ordering with keys- Keys physically reorders the dataset within the RAM (by reference) - No memory is used for sorting (other than marking which columns is the key) - The dataset is marked with an attribute _"sorted"_ - The dataset is always sorted in _ascending order_, with _NA_ first - Using `keyby` instead of `by` when grouping will set the grouping factors as keys:::{.callout-tip}See [this SO post](https://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table?rq=1) for more information on keys.:::```{r}setkey(MT, cyl, gear)setkeyv(MT, c("cyl", "gear"))MT```To see over which keys (if any) the dataset is currently ordered:```{r}haskey(MT)key(MT)```:::{.callout-warning}Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.:::### Ordering with (secondary) indices- `setindex` creates an index for the provided columns, but doesn’t physically reorder the dataset in RAM. - It computes the ordering vector of the dataset's rows according to the provided columns in an additional attribute called _index_ ```{r}#| echo: falseMT <-as.data.table(mtcars)``````{r}setindex(MT, cyl, gear)setindexv(MT, c("cyl", "gear"))MT```We can see the additional _index_ attribute added to the `data.table`:```{r}#| results: markupnames(attributes(MT))```We can get the currently used indices with:```{r}indices(MT)```Adding a new index doesn't remove a previously existing one:```{r}setindex(MT, hp)indices(MT)```We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !<!-------------------------------------------------------->## Subset / Filter```{r}#| echo: falseMT <-as.data.table(mtcars)IRIS <-as.data.table(iris)[, Species :=as.character(Species)]```### Basic filtering::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>filter(cyl >=6& disp <180)``````{r}iris |>filter(Species %in%c("setosa"))```##### data.table```{r}MT[cyl >=6& disp <180]``````{r}IRIS[Species %chin%c("setosa")]```For non-regex character filtering, use `%chin%` (which is a character-optimized version of `%in%`):::### Filter based on a range```{r}mtcars |>filter(between(disp, 200, 300))``````{r}MT[disp %between%c(200, 300)]```### Filter with a pattern```{r}mtcars |>filter(str_detect(disp, "^\\d{3}\\."))``````{r}MT[disp %like%"^\\d{3}\\."]``````{r}#| eval: false#| code-fold: true#| code-summary: VariantsIRIS[Species %flike%"set"] # Fixed (not regex)IRIS[Species %ilike%"Set"] # Ignore caseIRIS[Species %plike%"(?=set)"] # Perl-like regex```### Filter on row number (slicing)::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>slice(1) # slice_head(n = 1)``````{r}mtcars |>slice(n()) # slice_tail(n = 1)```***Slice a random sample of rows:```{r}mtcars |>slice_sample(n =5)```##### data.table```{r}MT[1]``````{r}MT[.N]```***Slice a random sample of rows:```{r}MT[sample(.N, 5)]```:::### Filter distinct/unique rows::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>distinct(mpg, hp, .keep_all =TRUE)```*****Number of unique rows/values**```{r}n_distinct(mtcars$gear)```##### data.table```{r}unique(MT, by =c("mpg", "hp")) # cols = other_cols_to_keep```*****Number of unique rows/values**```{r}uniqueN(MT, by ="gear")```:::### Filter by keysWhen keys or indices are defined, we can filter based on them, which is often a lot faster. :::{.callout-tip}We do not even need to specify the column name we are filtering on: the values will be attributed to the keys in order.:::```{r}setkey(MT, cyl)MT[.(6)] # Equivalent to MT[cyl == 6]``````{r}setkey(MT, cyl, gear)MT[.(6, 4)] # Equivalent to MT[cyl == 6 & gear == 4]```### Filter by indicesTo filter by indices, we can use the `on` argument, which creates a **temporary secondary index** on the fly (if it doesn't already exist).```{r}IRIS["setosa", on ="Species"]```Since the time to compute the secondary indices is quite small, we don’t have to use `setindex`, unless the task involves repeated subsetting on the same columns.:::{.callout-tip}When using `on` with multiple values, the `nomatch = NULL` argument avoids creating combinations that do not exist in the original data (i.e. for `cyl == 5` here):::```{r}MT[.(4:6, 4), on =c("cyl", "gear"), nomatch =NULL]```### Filtering on multiple columns**Filtering with one function taking multiple columns:**```{r}f_dat <- \(d) with(d, gear > cyl) # Function taking the data and comparing fix columnsf_dyn <- \(x, y) x > y # Function taking dynamic columns and comparing them``````{r}cols <-c("gear", "cyl")```::: {.panel-tabset group="framework"}##### Tidyverse**Manually:**```{r}mtcars |>filter(f_dyn(gear, cyl))```*****Dynamically:**Taking column names:```{r}mtcars |>filter(f_dyn(!!!syms(cols)))```Taking the data:```{r}mtcars |>filter(f_dat(cur_data()))```##### data.table**Manually:**```{r}MT[f_dyn(gear, cyl),]```*****Dynamically:**Taking column names:```{r}MT[do.call(f_dyn, args), env =list(args =as.list(cols))] # exec(f_dyn, !!!args)```Taking the data:```{r}MT[f_dat(MT),] # Can't use .SD in i```_In two steps:_:::{.callout-note appearance="simple"}## We can't use `.SD` in the `i` clause of a `data.table`But we can bypass that constraint by doing the operation in two steps: - Obtaining a vector stating if each row of the table matches or not the conditions - Filtering the original table based on the vector:::```{r}MT[MT[, f_dat(.SD)]]```:::**Combining multiple filtering functions:**This function filters rows that have 2 or more non-zero decimals, and we're going to call it on multiple columns:```{r}decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) >=2``````{r}cols <-c("drat", "wt", "qsec")```::: {.panel-tabset group="framework"}##### Tidyverse**Manually:**```{r}mtcars |>filter(decp(drat) &decp(wt) &decp(qsec))```*****Dynamically:**```{r}mtcars |>filter(if_all(cols, decp))```##### data.table**Manually:**```{r}MT[decp(drat) &decp(wt) &decp(qsec), ]```*****Dynamically:**```{r}MT[Reduce(`&`, lapply(mget(cols), decp)), ]``````{r}#| eval: false#| code-fold: trueMT[Reduce(`&`, lapply(MT[, ..cols], decp)), ]MT[Reduce(`&`, lapply(v1, decp)), env =list(v1 =as.list(cols))]```_In two steps:_```{r}MT[MT[, Reduce(`&`, lapply(.SD, decp)), .SDcols = cols]]```:::<!-------------------------------------------------------->## Rename```{r}#| echo: falseMT <-as.data.table(mtcars)IRIS <-as.data.table(iris)[, Species :=as.character(Species)]```:::{.callout-note}`setnames` changes column names **in-place**:::::: {.panel-tabset group="framework"}##### Tidyverse**Manually:**```{r}mtcars |>rename(CYL = cyl, MPG = mpg)```*****Dynamically:**```{r}mtcars |>rename_with(\(c) toupper(c), .cols =matches("^d"))```##### data.table**Manually:**```{r}setnames(copy(MT), c("cyl", "mpg"), c("CYL", "MPG"))[]```*****Dynamically:**```{r}setnames(copy(MT), grep("^d", colnames(MT)), toupper)[]```:::<!-------------------------------------------------------->## Select```{r}#| echo: falseMT <-as.data.table(mtcars)IRIS <-as.data.table(iris)[, Species :=as.character(Species)]```### Basic selection::: {.panel-tabset group="framework"}##### Tidyverse```{r}MT |>select(matches("cyl|disp"))```<br>*****Remove a column:**```{r}mtcars |>select(!cyl) # select(-cyl)```##### data.table```{r}MT[, .(mpg, disp)]``````{r}#| eval: false#| code-fold: trueMT[ , .SD, .SDcols =c("mpg", "disp")]MT[, .SD, .SDcols =patterns("mpg|disp")]```*****Remove a column:**```{r}MT[, !"cyl"] # MT[, -"cyl"]```In-place:```{r}copy(MT)[, cyl :=NULL][]```:::::: {.panel-tabset group="framework"}##### Tidyverse**Select & Extract:**```{r}#| results: markupmtcars |>pull(disp)```*****Select & Rename:**```{r}mtcars |>select(dispp = disp)```##### data.table**Select & Extract:**```{r}#| results: markupMT[, disp]```*****Select & Rename:**```{r}MT[, .(dispp = disp)]```:::### Dynamic selection### By name:```{r}cols <-c("cyl", "disp")```::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>select(all_of(cols)) # select(!!cols)```<br>*****Removing a column:**```{r}mtcars |>select(!{{cols}}) # select(-matches(cols))```##### data.table```{r}MT[, ..cols]``````{r}#| eval: false#| code-fold: trueMT[, mget(cols)] # RetiredMT[, cols, with =FALSE] # RetiredMT[, .SD, .SDcols = cols]MT[, j, env =list(j =as.list(cols))]```*****Removing a column:**```{r}MT[, !..cols]``````{r}#| eval: false#| code-fold: trueMT[, .SD, .SDcols =!cols]MT[, -j, env =list(j =I(cols))]```_In-place:_```{r}copy(MT)[, (cols) :=NULL][]```:::#### By pattern:::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>select(-matches("^d"))``````{r}mtcars |>select(where(\(x) all(x !=0))) # Only keep columns where no value == 0```##### data.table```{r}MT[, .SD, .SDcols =!patterns("^d")]``````{r}MT[, .SD, .SDcols = \(x) all(x !=0)] # Only keep columns where no value == 0``````{r}#| eval: false#| code-fold: showcopy(MT)[, grep("^d", colnames(MT)) :=NULL][] # In place (column deletion)MT[, MT[, sapply(.SD, \(x) all(x !=0))], with =FALSE]```:::#### By column type:```{r}iris |>select(where(\(x) !is.numeric(x)))``````{r}IRIS[, .SD, .SDcols =!is.numeric]```<!-------------------------------------------------------->## Mutate / Transmute```{r}#| echo: falseMT <-as.data.table(mtcars)IRIS <-as.data.table(iris)[, Species :=as.character(Species)]```**`data.table` can mutate in 2 ways:** - Using `=` creates a new DT with the new columns only (like `dplyr::transmute`) - Using `:=` (or `let`) modifies the current dt *in place* (like `dplyr::mutate`)The function modifying a column should be the same size as the original column (or group). If only one value is provided with `:=`, it will be recycled to the whole column/group.If the number of values provided is smaller than the original column/group: - With `:=` or `let`, an error will be raised, asking to manually specify how to recycle the values. - With `=`, it will behave like `dplyr::summarize` (if a grouping has been specified).### Basic transmuteOnly keeping the transformed columns.::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>transmute(cyl = cyl *2)```##### data.table```{r}MT[, .(cyl = cyl *2)]```*****Transmute & Extract:**```{r}#| results: markupMT[, (cyl = cyl *2)]```:::### Basic mutateModifies the transformed column **in-place** and keeps every other column as-is.::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>mutate(cyl =200)mtcars |>mutate(cyl =200, gear =5)```<br>```{r}mtcars |>mutate(mean_cyl =mean(cyl, na.rm =TRUE))``````{r}mtcars |>mutate(gear_plus =lead(gear))```##### data.table```{r}copy(MT)[, cyl :=200][]copy(MT)[, let(cyl =200, gear =5)][]``````{r}#| eval: false#| code-fold: truecopy(MT)[, `:=`(cyl =200, gear =5)][]copy(MT)[, c("cyl", "gear") := .(200, 5)][]``````{r}copy(MT)[, mean_cyl :=mean(cyl, na.rm =TRUE)][]``````{r}copy(MT)[, gearplus :=shift(gear, 1, type ="lead")][] # lead, lag, cyclic```:::### Dynamic trans/mutate```{r}LHS <-"mean_mpg"RHS <-"mpg"```::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>mutate({{LHS}} :=mean(mpg))``````{r}mtcars |>mutate("{LHS}":=mean(.data[[RHS]]))mtcars |>mutate({{LHS}} :=cur_data()[[RHS]] |>mean())mtcars |>mutate({{LHS}} :=pick({{ RHS }}) |>unlist() |>mean())```##### data.table```{r}copy(MT)[, (LHS) :=mean(mpg)][] # (LHS) <=> c(LHS)copy(MT)[, j :=mean(mpg), env =list(j = LHS)][]``````{r}copy(MT)[, c(LHS) :=mean(get(RHS))][]copy(MT)[, x :=mean(y), env =list(x = LHS, y = RHS)][]```:::### Conditional trans/mutate::: {.panel-tabset group="framework"}##### Tidyverse**Mutate everything based on multiple conditions:**One condition:```{r}mtcars |>mutate(Size =if_else(cyl >=6, "BIG", "small", missing ="Unk"))```Nested conditions:```{r}mtcars |>mutate(Size =case_when( cyl %between%c(2,4) ~"small", cyl %between%c(4,8) ~"BIG",.default ="Unk"))```*****Mutate only rows meeting conditions:**```{r}mtcars |>mutate(BIG =case_when(am ==1~ cyl >=6))```##### data.table**Mutate everything based on multiple conditions:**One condition:```{r}copy(MT)[, Size :=fifelse(cyl >=6, "BIG", "small", na ="Unk")][]```Nested conditions:```{r}copy(MT)[, Size :=fcase( cyl %between%c(2,4), "small", cyl %between%c(4,8), "BIG",default ="Unk")][]```*****Mutate only rows meeting conditions:**```{r}copy(MT)[am ==1, BIG := cyl >=6][]```:::### Complex trans/mutate#### Column-wise operations```{r}new <-c("min_mpg", "min_disp")old <-c("mpg", "disp")```**Apply one function to multiple columns:**::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>mutate(across(c("mpg", "disp"), min, .names ="min_{col}"))```<br><br>*****As a `transmute`:**```{r}mtcars |>transmute(across(c("mpg", "disp"), min, .names ="min_{col}"))```<br>*****Dynamically:**```{r}mtcars |>mutate(across(all_of(old), min, .names ="min_{col}"))```##### data.table```{r}copy(MT)[ , c("min_mpg", "min_disp") :=lapply(.SD, min), .SDcols =c("mpg", "disp") ][]``````{r}#| eval: falsecopy(MT)[, c("min_mpg", "min_disp") :=lapply(.(mpg, disp), min)][]```*****As a `transmute`:**A second step is needed to add `min_` before the names:```{r}(MT[, lapply(.SD[, .(mpg, disp)], min)] |>bind(d, setnames(d, names(d), \(x) paste0("min_", x))))[]```*****Dynamically:**```{r}copy(MT)[, c(new) :=lapply(mget(old), min)][]``````{r}#| eval: falsecopy(MT)[, c(new) :=lapply(x, min), env =list(x =as.list(old))][]```:::**Apply multiple functions to one or multiple column:**```{r}col <-"mpg"cols <-c("mpg", "disp")```::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>mutate(min_mpg =min(mpg), max_mpg =max(mpg))mtcars |>mutate(across(mpg, list(min = min, max = max), .names ="{fn}_{col}"))```<br>*****Multiple columns:**```{r}mtcars |>mutate(across(matches("mpg|disp"), list(min = min, max = max), .names ="{fn}_{col}"))``````{r}mtcars |>mutate(across(cols, list(min = \(x) min(x), max = \(x) max(x)), .names ="{fn}_{col}"))```##### data.table```{r}copy(MT)[, let(min_mpg =min(mpg), max_mpg =max(mpg))][]copy(MT)[, c("min_mpg", "max_mpg") := .(min(mpg), max(mpg))][]``````{r}#| eval: false#| code-fold: truecopy(MT)[, c("min_mpg", "max_mpg") :=lapply(.(mpg), \(x) list(min(x), max(x))) |>do.call(rbind, args = _) ][]copy(MT)[, c("min_mpg", "max_mpg") :=lapply(.(get(col)), \(x) list(min(x), max(x))) |>unlist(recursive =FALSE) ][]```*****Multiple columns:**```{r}copy(MT)[, c("min_mpg", "min_disp", "max_mpg", "max_disp") :=lapply(.SD, \(x) list(min(x), max(x))) |>do.call(rbind, args = _), .SDcols = cols][]``````{r}copy(MT)[, outer(c("min", "max"), cols, str_c, sep ="_") |>t() |>as.vector() :=lapply(.SD, \(x) list(min(x), max(x))) |>do.call(rbind, args = _), .SDcols = cols][]```:::#### Row-wise operations**Apply one function to multiple columns (row-wise):**::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>rowwise() |>mutate(rsum =sum(c_across(where(is.numeric)))) |>ungroup()mtcars |>mutate(rsum =pmap_dbl(across(where(is.numeric)), \(...) sum(c(...))))```Hybrid base R-Tidyverse:```{r}mtcars |>mutate(rsum =apply(across(where(is.numeric)), 1, sum))mtcars |>mutate(rsum =rowSums(across(where(is.numeric))))```##### data.table```{r}copy(MT)[, rsum :=rowSums(.SD), .SDcols = is.numeric][]copy(MT)[, rsum :=apply(.SD, 1, sum), .SDcols = is.numeric][]```:::**Apply multiple functions to multiple columns (row-wise)**::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>mutate(pmap_dfr(across(where(is.numeric)), \(...) list(mean =mean(c(...)), sum =sum(c(...)))))``````{r}#| eval: false#| code-fold: truemtcars |>mutate(pmap(across(where(is.numeric)), \(...) list(mean =mean(c(...)), sum =sum(c(...)))) |>bind_rows() )```Hybrid base R-Tidyverse:```{r}mtcars |>mutate(apply(across(where(is.numeric)), 1, \(x) list(mean =mean(x), sum =sum(x))) |>bind_rows())```##### data.table```{r}copy(MT)[, c("rmean", "rsum") :=apply(.SD, 1, \(x) list(mean(x), sum(x))) |>rbindlist(), .SDcols = is.numeric][]```:::**Apply an anonymous function inside the DT:**```{r}MT[, {print(summary(mpg)) x <- cyl + gear .(RN =1:.N, CG = x) }]```<!-------------------------------------------------------->## Group / Aggregate```{r}#| echo: falseMT <-as.data.table(mtcars)IRIS <-as.data.table(iris)[, Species :=as.character(Species)]```:::{.callout-note}The examples listed apply a grouping but do nothing (using `.SD` to simply keep all columns as is):::```{r}cols <-c("cyl", "disp")cols_missing <-c("cyl", "disp", "missing_col")```### Basic grouping::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>group_by(cyl, gear)```*****Dynamic grouping:**```{r}mtcars |>group_by(across(all_of(cols)))```Use `any_of` if you expect some columns to be missing in the data.```{r}mtcars |>group_by(across(any_of(cols_missing)))```##### data.table```{r}MT[, .SD, by = .(cyl, gear)]```*****Dynamic grouping:**```{r}MT[, .SD, by = cols]```To handle potentially missing columns:```{r}MT[, .SD, by =intersect(cols_missing, colnames(MT))]```:::### Current group info::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>group_by(cyl) |>filter(cur_group_id() ==1) |># To only keep one plotgroup_walk(\(d, g) with(d, plot(hp, mpg, main =paste("Cyl:", g$cyl))))```##### data.tableUse the `.BY` argument to get the current group name:```{r}#| eval: falseMT[, with(.SD, plot(hp, mpg, main =paste("Cyl:", .BY))), keyby = cyl]```:::<!-------------------------------------------------------->## Row numbers & indices```{r}#| echo: falseMT <-as.data.table(mtcars)IRIS <-as.data.table(iris)[, Species :=as.character(Species)]```### Adding row or group indices`.I`: Row indices `.N`: Number of rows `.GRP`: Group indices `.NGRP`: Number of groups #### Adding rows indices:```{r}mtcars |>mutate(I =row_number())copy(MT)[ , I := .I][]```#### Adding group indices:::: {.panel-tabset group="framework"}##### Tidyverse**Adding group indices (same index for each group):**```{r}mtcars |>summarize(GRP =cur_group_id(), .by = cyl)```Mutate instead of summarize:```{r}mtcars |>mutate(GRP =cur_group_id(), .by = cyl)```*****Adding row numbers within each group:**```{r}mtcars |>mutate(I_GRP =row_number(), .by = gear)```##### data.table**Adding group indices (same index for each group):**```{r}MT[, .GRP, by = cyl]```Mutate instead of summarize:```{r}copy(MT)[, GRP := .GRP, by = cyl][]```*****Adding row numbers within each group:**```{r}copy(MT)[, I_GRP :=1:.N, by = gear][]copy(MT)[, I_GRP :=rowid(gear)][]```:::### Filtering based on row numbers (slicing)#### Extracting a specific row:::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |> dplyr::first()mtcars |> dplyr::last()mtcars |> dplyr::nth(5)```##### data.table```{r}MT[1,] # data.table::first(MT)MT[.N,] # data.table::last(MT)MT[5,]```:::#### Slicing rows:::: {.panel-tabset group="framework"}##### Tidyverse```{r}tail(mtcars, 10)mtcars |>slice((n()-9):n())mtcars |>slice_tail(n =10)```##### data.table```{r}tail(MT, 10)MT[(.N-9):.N]MT[MT[, .I[(.N-9):.N]]] # Gets the last 10 rows' indices and filters based on them```:::#### Slicing groups:::: {.panel-tabset group="framework"}##### Tidyverse**Random sample by group:**```{r}mtcars |>slice_sample(n =5, by = cyl)```*****Filter groups by condition:**```{r}mtcars |>filter(n() >=8, .by = cyl)mtcars |>group_by(cyl) |>group_modify(\(d,g) if (nrow(d) >=8) d elsedata.frame())```##### data.table**Random sample by group:**```{r}MT[, .SD[sample(.N, 5)], keyby = cyl]```*****Filter groups by condition:**```{r}MT[, if(.N >=8) .SD, by = cyl]MT[, .SD[.N >=8], by = cyl]```:::### Extracting row indices#### Getting the row numbers of specific observations:::: {.panel-tabset group="framework"}##### TidyverseRow number of the first and last observation of each group:```{r}mtcars |>reframe(I =cur_group_rows()[c(1, n())], .by = cyl)```... while keeping all other columns:```{r}mtcars |>mutate(I =row_number()) |>slice(c(1, n()), .by = cyl)```##### data.tableRow number of the first and last observation of each group:```{r}MT[, .I[c(1, .N)], by = cyl]```... while keeping all other columns:```{r}copy(MT)[, I := .I][, .SD[c(1, .N)], by = cyl]```:::#### Extracting row indices after filtering:::: {.panel-tabset group="framework"}##### Tidyverse<br>Extracting row numbers in the original dataset:```{r}mtcars |>mutate(I =row_number()) |>filter(gear ==4) |>pull(I)```Extracting row numbers in the new dataset (after filtering):```{r}mtcars |>filter(gear ==4) |>mutate(I =row_number()) |>pull(I)```##### data.table:::{.callout-warning}`.I` gives the vector of row numbers *after* any subsetting/filtering has been done:::Extracting row numbers in the original dataset:```{r}MT[, .I[gear ==4]]```Extracting row numbers in the new dataset (after filtering):```{r}MT[gear ==4, .I]```:::<!-------------------------------------------------------->## Relocate### Basic reordering::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>relocate(cyl, .after =last_col())```<br>Relocate a new column (mutate + relocate):```{r}mtcars |>mutate(GRP =cur_group_id(), .by = cyl, .before =1)```##### data.table```{r}setcolorder(copy(MT), "cyl", after =last(colnames(MT)))[]setcolorder(copy(MT), c(setdiff(colnames(MT), "cyl"), "cyl"))[]```Relocate a new column (mutate + relocate):```{r}setcolorder(copy(MT)[ , GRP := .GRP, by = cyl], "GRP")[]```:::### Reordering by column names::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>select(sort(tidyselect::peek_vars()))``````{r}mtcars |>select(carb, sort(tidyselect::peek_vars()))```##### data.table```{r}setcolorder(copy(MT), sort(colnames(MT)))[]``````{r}setcolorder(copy(MT), c("carb", sort(setdiff(colnames(MT), "carb"))))[]```:::<!-------------------------------------------------------->## Summarize/Reframe:```{r}#| echo: falseMT <-as.data.table(mtcars)IRIS <-as.data.table(iris)[, Species :=as.character(Species)]```With `data.table`, one needs to use the `=` operator to summarize. It takes a **function that returns a list of values smaller than the original column** (or group) size. By default, it will **only keep the modified columns** (like a `transmute`).### Basic summary```{r}mtcars |>summarize(mean_cyl =mean(cyl))``````{r}MT[, .(mean_cyl =mean(cyl))]```### Grouped summary::: {.panel-tabset group="framework"}##### TidyverseBy default, `dplyr::summarize` will `arrange` the result by the grouping factor:```{r}mtcars |>summarize(N =n(), .by = cyl)```To order by the grouping factor, use `group_by()` instead of `.by`:```{r}mtcars |>group_by(cyl) |>summarize(N =n())```##### data.tableBy default, `data.table` keeps the order the groups originally appear in:```{r}MT[, .N, by = cyl]```To order by the grouping factor, use `keyby` instead of `by`:```{r}MT[, .N, keyby = cyl]```:::**Grouped on a temporary variable:**```{r}mtcars |>group_by(cyl >6) |>summarize(N =n())``````{r}MT[, .N, by = .(cyl >6)]```### Column-wise summary#### Apply one function to multiple columns:::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>summarize(across(everything(), mean), .by = cyl)```By column type:```{r}mtcars |>summarize(across(where(is.double), mean), .by = cyl)```By matching column names:```{r}mtcars |>summarize(across(matches("^d"), mean), .by = cyl)```##### data.table```{r}MT[, lapply(.SD, mean), by = cyl]```By column type:```{r}MT[, lapply(.SD[, -"cyl"], mean), by = cyl, .SDcols = is.double]```By matching column names:```{r}MT[, lapply(.SD, mean), by = cyl, .SDcols =patterns("^d")]```:::#### Applying multiple functions to one column:::: {.panel-tabset group="framework"}##### Tidyverse```{r}mtcars |>summarize(mean(mpg), sd(mpg), .by = cyl)```<br><br>With column names:```{r}mtcars |>summarize(mean =mean(mpg), sd =sd(mpg), .by = cyl)``````{r}mtcars |>summarize(across(mpg, list(mean = mean, sd = sd), .names ="{fn}"), .by = cyl)```##### data.table```{r}MT[, .(mean(mpg), sd(mpg)), by = cyl]``````{r}MT[, lapply(.(mpg), \(x) list(mean(x), sd(x))) |>rbindlist(), by = cyl]```With column names:```{r}MT[, .(mean =mean(mpg), sd =sd(mpg)), by = cyl]``````{r}MT[, lapply(.SD, \(x) list(mean =mean(x), sd =sd(x))) |>rbindlist(), by = cyl, .SDcols ="mpg"]```:::#### Apply multiple functions to multiple columns::::{.callout-note}Depending on the output we want (i.e. having the function's output as columns or rows), we can either provide a list of functions to apply (`list_of_fns`), or a function returning a list (`fn_returning_list`).:::```{r}cols <-c("mpg", "hp")list_of_fns <-list(mean = \(x) mean(x), sd = \(x) sd(x))fn_returning_list <- \(x) list(mean =mean(x), sd =sd(x))```::: {.panel-tabset group="framework"}##### Tidyverse**One column per function, one row per variable:**```{r}reframe(mtcars, map_dfr(pick(all_of(cols)), fn_returning_list, .id ="Var"), .by = cyl)``````{r}#| eval: false#| code-fold: truereframe(mtcars, map(pick(all_of(cols)), fn_returning_list) |>bind_rows(.id ="Var"), .by = cyl)```*****One column per variable, one row per function:**```{r}reframe(mtcars, map_dfr(list_of_fns, \(f) map(pick(all_of(cols)), f), .id ="Fn"), .by = cyl)```*****One column per function/variable combination:**```{r}summarize(mtcars, across(all_of(cols), list_of_fns, .names ="{col}.{fn}"), .by = cyl)```##### data.table**One column per function, one row per variable:**```{r}MT[, lapply(.SD, fn_returning_list) |>rbindlist(idcol ="Var"), by = cyl, .SDcols = cols]```<br>*****One column per variable, one row per function:**```{r}MT[, lapply(list_of_fns, \(f) lapply(.SD, f)) |>rbindlist(idcol ="Fn"), by = cyl, .SDcols = cols]```*****One column per function/variable combination:**```{r}MT[, lapply(.SD, fn_returning_list) |>unlist(recursive =FALSE), by = cyl, .SDcols = cols]``````{r}#| echo: false#| code-fold: trueMT[, lapply(.SD, fn_returning_list) |>do.call(c, args = _), by = cyl, .SDcols = cols]```Different column order & naming scheme:```{r}MT[, lapply(list_of_fns, \(f) lapply(.SD, f)) |>unlist(recursive =FALSE), by = cyl, .SDcols = cols]```Using `dcast` (see next section for more on pivots):```{r}dcast(MT, cyl ~ ., fun.agg = list_of_fns, value.var = cols) # list(mean, sd)```:::<!--------------------------------------------------------><!--------------------------------------------------------># Pivots***<!-------------------------------------------------------->## Melt / Longer**Data:**```{r}#| echo: false#| output: falseFAM1 <-fread("family_id age_mother dob_child1 dob_child2 dob_child31 30 1998-11-26 2000-01-29 NA2 27 1996-06-22 NA NA3 26 2002-07-11 2004-04-05 2007-09-024 32 2004-10-10 2009-08-27 2012-07-215 29 2000-12-05 2005-02-28 NA")FAM2 <-fread("family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child31 30 1998-11-26 2000-01-29 NA 1 2 NA2 27 1996-06-22 NA NA 2 NA NA3 26 2002-07-11 2004-04-05 2007-09-02 2 2 14 32 2004-10-10 2009-08-27 2012-07-21 1 1 15 29 2000-12-05 2005-02-28 NA 2 1 NA")``````{r}(fam1 <-as.data.frame(FAM1))(fam2 <-as.data.frame(FAM2))```### Basic Melt/Longer:::: {.callout-tip}`data.table::melt` does partial argument matching and thus accepts shortened versions of its arguments. E.g.: `variable.name` <=> `variable` (or `var`), `value.name` <=> `value` (or `val`), `measure.vars` <=> `measure`, `id.vars` <=> `id`, `pattern` <=> `pat`, ...::::**One group of columns --> single value column**::: {.panel-tabset group="framework"}##### Tidyverse```{r}pivot_longer(FAM1, cols =matches("dob_"), names_to ="variable")```##### data.table```{r}melt(FAM1, measure.vars =c("dob_child1", "dob_child2", "dob_child3"))melt(FAM1, measure =patterns("^dob_"))```:::**One group of columns --> multiple value columns**::: {.panel-tabset group="framework"}##### Tidyverse```{r}# No direct equivalent```##### data.table```{r}melt(FAM1, measure =patterns(child1 ="child1$", child2 ="child2$|child3$"))```:::### Merging multiple yes/no columnsMelting multiple presence/absence columns into a single variable:**Data:**```{r}#| echo: false#| output: falsemovies_wide <-tibble(ID =1:3,action =c(1,1,1),adventure =c(0,1,1),animation =c(0,0,1))``````{r}(MOVIES_WIDE <-as.data.table(movies_wide))```::: {.panel-tabset group="framework"}##### Tidyverse```{r}pivot_longer( movies_wide, -ID, names_to ="Genre", values_transform = \(x) ifelse(x ==0, NA, x), values_drop_na =TRUE ) |>select(-value)```##### data.table```{r}melt(MOVIES_WIDE, id.vars ="ID", var ="Genre")[value !=0][order(ID), -"value"]```:::### Partial pivotMultiple groups of columns --> Multiple value columns::: {.panel-tabset group="framework"}##### Tidyverse**Using `.value`:**:::{.callout-tip}Using the `.value` special identifier allows to do a "half" pivot: the values that would be listed as rows under `.value` are instead used as columns.:::```{r}pivot_longer(fam2, matches("^dob|^gender"), names_to =c(".value", "child"), names_sep ="_child")```##### data.table**Using `.value`:**```{r}melt(FAM2, measure =patterns("^dob", "^gender"), val =c("dob", "gender"), var ="child")```*****Manually:**```{r}colA <-str_subset(colnames(FAM2), "^dob")colB <-str_subset(colnames(FAM2), "^gender")melt(FAM2, measure =list(colA, colB), val =c("dob", "gender"), var ="child")``````{r}#| eval: false#| code-fold: truemelt(FAM2, measure =list(a, b), val =c("dob", "gender"), var ="child") |>substitute2(env =list(a =I(str_subset(colnames(FAM2), "^dob")), b =I(str_subset(colnames(FAM2), "^gender")))) |>eval()```*****Using `measure` and `value.name`:**```{r}melt(FAM2, measure =measure(value.name, child = \(x) as.integer(x), sep ="_child"))``````{r}#| eval: false#| code-fold: truemelt(FAM2, measure =measurev(list(value.name =NULL, child = as.integer), pat ="(.*)_child(\\d)"))```:::<!-------------------------------------------------------->## Dcast / Wider**General idea:** - Pivot around the combination of `id.vars` (LHS of the formula) - The `measure.vars` (RHS of the formula) are the ones whose values become column names - The `value.var` are the ones the values are taken from to fill the new columns**Data:**```{r}#| echo: falseFAM1L <-melt(FAM1, measure =c("dob_child1", "dob_child2", "dob_child3"))FAM2L <-melt(FAM2, measure =measure(value.name, child = \(.x) as.integer(.x), sep ="_child"))``````{r}(fam1l <-as.data.frame(FAM1L))(fam2l <-as.data.frame(FAM2L))```### Basic Dcast/Wider::: {.panel-tabset group="framework"}##### Tidyverse```{r}pivot_wider(fam1l, id_cols =c("family_id", "age_mother"), names_from ="variable")```##### data.table```{r}dcast(FAM1L, family_id + age_mother ~ variable)```:::**Using all the columns as IDs:**::: {.panel-tabset group="framework"}##### Tidyverse```{r}pivot_wider(fam1l, names_from = variable)```:::: {.callout-note}By default, `id_cols = everything()`::::##### data.table```{r}FAM1L |>dcast(... ~ variable)```:::: {.callout-note}`...` <=> "every unused column":::::::**Multiple value columns --> Multiple groups of columns:**::: {.panel-tabset group="framework"}##### Tidyverse```{r}pivot_wider( fam2l, id_cols =c("family_id", "age_mother"), values_from =c("dob", "gender"), names_from ="child", names_sep ="_child")```##### data.table```{r}dcast(FAM2L, family_id + age_mother ~ child, value.var =c("dob", "gender"), sep ="_child")dcast(FAM2L, ... ~ child, value.var =c("dob", "gender"), sep ="_child")```:::**Dynamic names in the formula:**```{r}var_name <-"variable"id_vars <-c("family_id", "age_mother")```::: {.panel-tabset group="framework"}##### Tidyverse```{r}pivot_wider(fam1l, id_cols =c(family_id, age_mother), names_from = {{ var_name }})```<br>Multiple dynamic names:```{r}pivot_wider(fam1l, id_cols =all_of(id_vars), names_from = variable)```<br>##### data.table```{r}dcast(FAM1L, family_id + age_mother ~ base::get(var_name))dcast(FAM1L, family_id + age_mother ~ x) |>substitute2(env =list(x = var_name)) |>eval()```Multiple dynamic names:```{r}dcast(FAM1L, str_c(str_c(id_vars, collapse =" + "), " ~ variable"))dcast(FAM1L, x + y ~ variable) |>substitute2(env =list(x = id_vars[1], y = id_vars[2])) |>eval()```:::### Renaming (prefix/suffix) the columns::: {.panel-tabset group="framework"}##### Tidyverse```{r}pivot_wider(fam1l, names_from = variable, values_from = value, names_prefix ="Attr: ")pivot_wider(fam1l, names_from = variable, values_from = value, names_glue ="Attr: {variable}")```##### data.table```{r}dcast(FAM1L, family_id + age_mother ~paste0("Attr: ", variable))```:::### Unused combinations:::{.callout-warning}The logic is inverted between `dplyr` (keep) and `data.table` (drop)::::::: {.panel-tabset group="framework"}##### Tidyverse```{r}pivot_wider(fam1l, names_from = variable, values_from = value, id_expand = T, names_expand = F)```##### data.table```{r}dcast(FAM1L, family_id + age_mother ~ variable, drop =c(FALSE, TRUE)) # (drop_LHS, drop_RHS)```:::### Subsetting::: {.panel-tabset group="framework"}##### Tidyverse```{r}fam1l |>filter(value >= lubridate::ymd(20030101)) |>pivot_wider(id_cols =c("family_id", "age_mother"), names_from ="variable")```:::{.callout-warning}AFAIK, `pivot_wider` can't do this on its own.:::##### data.table```{r}dcast(FAM1L, family_id + age_mother ~ variable, subset = .(value >= lubridate::ymd(20030101)))```:::### AggregatingIn `data.table`, not specifying the column holding the measure vars (the names) will result in an empty column counting the number of columns that should have been created for all the measures (i.e. the `length()` of the result).::: {.panel-tabset group="framework"}##### Tidyverse```{r}(pivot_wider(fam1l, id_cols =c(family_id, age_mother), names_from = variable, values_fn = length)|>mutate(length =apply(pick(matches("_child")), 1, \(x) sum(x))) |>select(-matches("^dob_")))```##### data.table```{r}dcast(FAM1L, family_id + age_mother ~ .)```:::Customizing the default behavior (`length()`) using the `fun.aggregate` (<=> `fun.agg` or `fun`) argument:*Here, we count the number of child for each each combination of (family_id + age_mother) -> sum all non-NA `value`*::: {.panel-tabset group="framework"}##### Tidyverse```{r}(pivot_wider( fam1l, id_cols =c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x) ) |>mutate(child_count =apply(pick(matches("_child")), 1, \(x) sum(x)))|>select(-matches("^dob_")))``````{r}#| eval: false#| code-fold: true(pivot_wider(fam1l, id_cols =c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x))|>mutate(child_count =pmap_int(pick(matches("_child")), \(...) sum(...)))|>select(-matches("^dob_")))(pivot_wider(fam1l, id_cols =c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x))|>rowwise()|>mutate(child_count =sum(c_across(matches("_child"))))|>ungroup()|>select(-matches("^dob_")))```##### data.table```{r}(dcast(FAM1L, family_id + age_mother ~ ., fun = \(x) sum(!is.na(x))) |>setnames(".", "child_count"))```:::**Applying multiple `fun.agg`:**Data:```{r}(DTL <-data.table(id1 =sample(5, 20, TRUE), id2 =sample(2, 20, TRUE), group =sample(letters[1:2], 20, TRUE), v1 =runif(20), v2 =1L ))```::: {.panel-tabset group="framework"}##### Tidyverse* Multiple aggregation functions applied to one variable:```{r}(pivot_wider( DTL, id_cols =c("id1", "id2"), names_from ="group", values_from ="v1",names_glue ="{.value}_{.name}", names_vary ="slowest", names_sort =TRUE,values_fn = \(x) tibble("sum"=sum(x), "mean"=mean(x)) ) |>unnest(cols =starts_with("v1"), names_sep ="_"))```**** Multiple aggregation functions applied to multiple variables (all combinations):```{r}(DTL |>pivot_wider(id_cols =c("id1", "id2"), names_from ="group", names_vary ="slowest", names_sort =TRUE,values_from =c("v1", "v2"), values_fn = \(x) tibble("sum"=sum(x), "mean"=mean(x)) ) |>unnest(cols =matches("^v1|^v2"), names_sep ="_"))```**** Multiple aggregation functions applied to multiple variables (one-to-one):```{r}# Not possible with pivot_wider AFAIK```##### data.table* Multiple aggregation functions applied to one variable:```{r}dcast(DTL, id1 + id2 ~ group, fun =list(sum, mean), value.var ="v1")```<br><br>**** Multiple aggregation functions applied to multiple variables (all combinations):```{r}dcast(DTL, id1 + id2 ~ group, fun =list(sum, mean), value.var =c("v1", "v2"))```<br><br>**** Multiple aggregation functions applied to multiple variables (one-to-one):*Here, we apply `sum` to `v1` (for both `group` a & b), and `mean` to `v2` (for both `group` a & b)*```{r}dcast(DTL, id1 + id2 ~ group, fun =list(sum, mean), value.var =list("v1", "v2"))```:::### One-hot encodingMaking each level of a variable into a presence/absence column:```{r}#| echo: false#| output: falsemovies_long <-data.frame(ID =c(1L, 2L, 2L, 3L, 3L, 3L), Genre =c("action", "action", "adventure", "action", "adventure", "animation"),OtherCol =runif(6))MOVIES_LONG <-as.data.table(movies_long)``````{r}movies_long```::: {.panel-tabset group="framework"}##### Tidyverse```{r}pivot_wider( movies_long, names_from ="Genre", values_from ="Genre", values_fn = \(x) !is.na(x), values_fill =FALSE)```##### data.table```{r}dcast(MOVIES_LONG, ... ~ Genre, value.var ="Genre", fun = \(x) !is.na(x), fill =FALSE)```:::<!--------------------------------------------------------><!--------------------------------------------------------># Joins***:::{.callout-tip}A JOIN is just a special type of SUBSET: we subset the rows of one table based on the matching rows of a second one. And the matching conditions define what type of join we are applying.:::<!-------------------------------------------------------->## Mutating JoinsThe purpose of **mutating joins** is to add columns/information from one table to another, by matching their rows.**Data:**```{r}#| echo: falsecities <- (data.frame(city =c("Rouen", "Helsinki", "Punaauia", "Barcelona", "Sosua", "Trondheim", "Bergen", "Linz", "Queenstown", "Bern"),country_id =c(5, 4, 6, 9, 3, 8, 8, 1, 7, 10) )|>arrange(city)|>mutate(city_id =row_number(), .before =1))countries <- (data.frame(country =c("France", "Finland", "French Polynesia", "Spain", "Dominican Republic", "Canada", "Norway", "Austria", "New-Zealand") )|>arrange(country)|>mutate(country_id =row_number(), .before =1))``````{r}(CITIES <-as.data.table(cities))(COUNTRIES <-as.data.table(countries))```### Left/Right JoinBoth left & right joins append the columns of one table to those of another, in the order they are given (i.e. columns of the first table will appear first in the result). However, how rows are matched (and how the ones not finding a match are handled) depends on the type of join: - **Left joins** match on the rows of the first (left) table. Unmatched rows from the left table will be kept, but not the right's. - **Right joins** match on the rows of the second (right) table. Unmatched rows from the right table will be kept, but not the left's. ::: {.callout-tip appearance="simple"}#### ExampleTo find out which country each city belongs to, we're going to merge **countries into cities**.Here, we want to add data to the `cities` table by matching each city to a country (by their `country_id`). The ideal output would have the columns of `cities` first, and keep all rows from `cities`, even if unmatched: thus we will use a **left join**.:::* As a **left join**:::: {.panel-tabset group="framework"}#### Tidyverse```{r}left_join(cities, countries, by ="country_id", multiple ="all")```#### data.table:::: {.callout-warning appearance="simple"}## `data.table` natively only supports right joinsIt filters the rows of the first table by those of the second (`FIRST[SECOND]`), *but* only keeps the unmatched rows from the second table. ```{r}#| code-fold: true#| code-summary: The normal output of the joinCITIES[COUNTRIES, on = .(country_id)]```The unmatched rows from `countries` were kept, but not the ones from `cities`. Here are two possible workarounds: ::::Inverting the two tables (`countries` first), and then inverting the order of the columns in the result:```{r}COUNTRIES[CITIES, .(city_id, city, country_id, country), on = .(country_id)]```Adding the columns of `countries` (in-place) to `cities` during the join:```{r}copy(CITIES)[COUNTRIES, c("country_id", "country") :=list(i.country_id, i.country), on = .(country_id)][]```:::We could accomplish a similar result with a right join by inverting the order of appearance of the columns. But the order of the columns in the result will be less ideal (countries first):* As a **right join:**::: {.panel-tabset group="framework"}#### Tidyverse```{r}right_join(countries, cities, by ="country_id", multiple ="all")```#### data.table```{r}COUNTRIES[CITIES, on = .(country_id)][order(country_id)]```:::### Full JoinFully merges the two tables, keeping the unmatched rows from both tables.::: {.panel-tabset group="framework"}#### Tidyverse```{r}full_join(cities, countries, by =join_by(country_id))```#### data.table```{r}merge(CITIES, COUNTRIES, by ="country_id", all =TRUE)[order(city_id), .(city_id, city, country_id, country)]```:::### Cross JoinGenerating all combinations of the IDs of both tables.::: {.panel-tabset group="framework"}#### Tidyverse```{r}cross_join(select(cities, city), select(countries, country))```#### data.table```{r}CJ(city = CITIES[, city], country = COUNTRIES[, country])```:::### Inner JoinMerges the columns of both tables and only returns the rows that matched between *both* tables (no unmatched rows are kept).::: {.panel-tabset group="framework"}#### Tidyverse```{r}inner_join(countries, cities, by ="country_id", multiple ="all")```#### data.table```{r}COUNTRIES[CITIES, on = .(country_id), nomatch =NULL]```:::### Self joinMerging the table with itself. Typically used on graph-type data represented as a flat table (e.g. hierarchies).**Data:**```{r}#| echo: falseHIERA <-fread("id first_name last_name manager_id 1 Maisy Bloom NA 2 Caine Farrow 1 3 Waqar Jarvis 2 4 Lacey-Mai Rahman 2 5 Merryn French 3")(hiera <-as.data.frame(HIERA))```The goal here is to find the identity of everyone's n+1 by merging the table on itself:::: {.panel-tabset group="framework"}#### Tidyverse```{r}left_join(hiera, hiera, by =join_by(manager_id == id))```#### data.table```{r}HIERA[HIERA, on = .(manager_id = id), nomatch =NULL]```:::<!-------------------------------------------------------->## Filtering Joins```{r}#| echo: false#| output: falseCITIES <-as.data.table(cities)COUNTRIES <-as.data.table(countries)```Use to filter one table (left) based on another (right): it will only keep the columns from the left table and will either keep (**semi join**) or discard (**anti join**) the rows where IDs match between both tables.### Semi join::: {.callout-note}Will give the same result as an inner join, but will only keep the columns of the first table (no information is added).:::Here, it will filter `countries` to only keep the countries having a matching `country_id` in the cities table.::: {.panel-tabset group="framework"}#### Tidyverse```{r}semi_join(countries, cities, by =join_by(country_id))```#### data.table```{r}COUNTRIES[country_id %in% CITIES[, unique(country_id)]]``````{r}#| eval: false#| code-fold: truefsetdiff(COUNTRIES, COUNTRIES[!CITIES, on ="country_id"])COUNTRIES[!eval(COUNTRIES[!CITIES, on = .(country_id)])]```:::### Anti joinHere, it will filter `countries` to only keep the countries having no matching `country_id` in the cities table.::: {.panel-tabset group="framework"}#### Tidyverse```{r}anti_join(countries, cities, by =join_by(country_id))```#### data.table```{r}COUNTRIES[!CITIES, on = .(country_id)]``````{r}#| eval: false#| code-fold: trueCOUNTRIES[fsetdiff(COUNTRIES[, .(country_id)], CITIES[, .(country_id)])]```:::<!-------------------------------------------------------->## Non-equi joins**Non-equi joins** are joins where the the condition to match rows are no longer strict equalities between the tables' ID columns. We can divide non-equi joins between: - **Unequality joins**: a general unequality condition between IDs, that could result in multiple matches. - **Rolling joins**: only keep the match that minimizes the distance between the IDs (i.e. the closest to perfect equality). - **Overlap joins**: matching to all values within a range. ::: {.callout-tip}Please refer to [this page](https://r4ds.hadley.nz/joins.html#non-equi-joins) of the second edition of R4DS for more detailed explanations.:::**Data:**Events:```{r}#| echo: falseevents <- (data.frame(event =c("Alice & Mark's wedding", "Alice's graduation", "John's birthday"),e.start =c("2023-06-07 13:00:00", "2023-06-05 10:00:00", "2023-06-05 12:00:00"),e.end =c("2023-06-07 18:00:00", "2023-06-05 13:00:00", "2023-06-05 22:00:00") )|>mutate(across(matches("start|end"), lubridate::ymd_hms))|>arrange(e.start)|>mutate(e.id =row_number(), .before =1))(EVENTS <-as.data.table(events))```Strikes:```{r}#| echo: falsestrikes <- (data.frame(strike_motive =c("Life's too expensive", "Not enough cheese", "Our team lost some sport event", "Not enough wine"),s.start =c("2023-06-08 9:00:00", "2023-06-05 11:00:00", "2023-07-05 16:00:00", "2023-06-05 14:00:00"),s.end =c("2023-06-08 20:00:00", "2023-06-05 20:00:00", "2023-07-05 22:00:00", "2023-06-05 16:00:00") )|>mutate(across(matches("start|end"), lubridate::ymd_hms))|>arrange(s.start)|>mutate(s.id =row_number(), .before =1))(STRIKES <-as.data.table(strikes))```### Unequality joinInequality joins are joins (left, right, inner, ...) that use inequalities (`<`, `<=`, `>=`, or `>`) to specify the matching criteria.::: {.callout-warning}The condition has to be a simple inequality between existing columns: it cannot be an arbitrary function (e.g. `date.x <= min(date.y) * 2` will not work).:::* For each event, which strikes occurred (finished) before the event ?::: {.panel-tabset group="framework"}#### Tidyverse```{r}inner_join(events, strikes, join_by(e.start >= s.end))```#### data.table```{r}EVENTS[STRIKES, on = .(e.start >= s.end), nomatch =NULL]```::: {.callout-caution}When specifying an equality or inequality condition, `data.table` will merge the two columns: only one will remain, with the values of the second column and the name of the first. Here, `e.start` will have the values of `s.end` (which will be removed).I'm not sure if this is a bug or not.::::::::: {.callout-tip appearance="simple"}## A useful use-case for un-equality joins is to avoid duplicates when generating combinations of items in cross joins:**Data:**```{r}#| echo: falsepeople <-data.frame(name =c("Alice", "Mark", "John")) |>mutate(id =row_number(), .before =1)people```:::: {.columns}::::: {.column width="49%"}**All permutations:** with duplicates (order matters)```{r}cross_join(people, people)```:::::::::: {.column width="2%"}:::::::::: {.column width="49%"}**All combinations:** without duplicates (order doesn't matter)```{r}inner_join(people, people, join_by(id < id))```::::::::::::### Rolling joinsRolling joins are a special type of inequality join where instead of getting every row that satisfies the inequality, we get the one where the IDs are the closest to equality.::: {.panel-tabset group="framework"}#### Tidyverse* Which strike started the soonest *after* the beginning an event ?```{r}inner_join(events, strikes, join_by(closest(e.start <= s.start)))```**** Which strike ended the soonest *before* the start an event ?```{r}inner_join(events, strikes, join_by(closest(e.start >= s.end)))```#### data.table* Which strike started the soonest *after* the beginning an event ?```{r}EVENTS[STRIKES, on = .(e.start == s.start), roll ="nearest" ][, .SD[which.min(abs(e.start - e.end))], by ="e.id"]```::: {.callout-note}Using the `roll` argument relaxes the equality constraint of the join (`e.start == s.end`).:::**** Which strike ended the soonest *before* the start an event ?```{r}EVENTS[STRIKES, on = .(e.start == s.end), roll =-Inf ][, .SD[which.min(abs(e.start - e.end))], by ="e.id"]``````{r}#| echo: false#| eval: falseEVENTS[STRIKES, on = .(e.start >= s.end), nomatch =NULL][, .SD[which.min(abs(e.start - e.end))], by ="event"]```:::### Overlap joins::: {.panel-tabset group="framework"}#### Tidyverse:::: {.callout-note appearance="simple" collapse="true"}##### `dplyr` helper functions`dplyr` provides three helper functions to make it easier to work with intervals: - `between(x, y_min, y_max)` <=> `x >= y_min, x <= y_max`: a value of the first table is within a given range of the second - `within(x_min, x_max, y_min, y_max)` <=> `x_min >= y_min, x_max <= y_max`: the ranges of the first table are contained within the second's - `overlaps(x_min, x_max, y_min, y_max)` <=> `x_min <= y_max, x_max >= y_min`: the two ranges overlap partially or totally, in any direction::::* **Between:** Which events had a strike staring in the two hours before the beginning of the event ?:::: {.callout-tip}First, we need to create the new "2 hours after the beginning of the event" column since we cannot use arbitrary functions in `join_by()` (e.g. we cannot do `between(s.start, e.start, e.start + hours(2))`)::::```{r}events2 <-mutate(events, e.start_minus2 = e.start -hours(2))``````{r}inner_join(strikes, events2, join_by(between(s.start, e.start_minus2, e.start))) |>select(colnames(events), colnames(strikes)) # Re-ordering the columns```:::: {.callout-note}By default, the value to match needs to be from the first table, and the range it falls within needs to be from the second table. Depending on the column order we need, this can force us to reorder the columns post-join (as in the above example).This can be alleviated by manually specifying from which table each column comes from, using `x$col` and `y$col` (x referring the to first column).::::```{r}inner_join(events2, strikes, join_by(between(y$s.start, x$e.start_minus2, x$e.start))) |>select(-e.start_minus2)```Manually:```{r}inner_join(events2, strikes, join_by(e.start_minus2 <= s.start, e.start >= s.start)) |>select(-e.start_minus2)```**** **Within:** Which strikes occurred entirely within the period of an event ?```{r}inner_join(strikes, events, join_by(within(s.start, s.end, e.start, e.end)), multiple ="all") |>select(colnames(events), colnames(strikes)) # Re-ordering the columns```:::: {.callout-note}As before, `within()` requires the first range to be within the second by default, meaning the first table must be the one with the smaller range. Using `x$col` and `y$col` resolves the issue of column order.::::```{r}inner_join(events, strikes, join_by(within(y$s.start, y$s.end, x$e.start, x$e.end)), multiple ="all")```Manually:```{r}inner_join(events, strikes, join_by(e.start <= s.start, e.end >= s.end), multiple ="all")```**** **Overlaps:** Which events overlap with each-other ?```{r}inner_join(events, events, join_by(e.id < e.id, overlaps(e.start, e.end, e.start, e.end)))```Manually:```{r}inner_join(events, events, join_by(e.id < e.id, e.start <= e.end, e.end >= e.start))```#### data.table<br>* **Between:** Which events had a strike staring in the two hours before the beginning of the event ?```{r}copy(EVENTS)[, e.start_minus2 := e.start -hours(2) ][STRIKES, on = .(e.start_minus2 <= s.start, e.start >= s.start), nomatch =NULL ][, -"e.start_minus2"]```<br><br><br><br><br><br><br><br><br>**** **Within:** Which strikes occurred entirely within the period of an event ?```{r}EVENTS[STRIKES, on = .(e.start <= s.start, e.end >= s.end), nomatch =NULL]```<br><br><br><br><br><br><br><br><br>**** **Overlaps:** Which events overlap with each-other ?```{r}EVENTS[EVENTS, on = .(e.id < e.id, e.start <= e.end, e.end >= e.start), nomatch =NULL]``````{r}setkey(EVENTS, e.start, e.end)foverlaps(EVENTS, EVENTS, type ="any", mult ="first", nomatch =NULL)[e.id != i.e.id]```:::<!--------------------------------------------------------><!--------------------------------------------------------># Tidyr & Others***```{r}#| echo: falseMT <-as.data.table(mtcars)IRIS <-as.data.table(iris)[, Species :=as.character(Species)]```<!-------------------------------------------------------->## Remove NA::: {.panel-tabset group="framework"}#### Tidyverse```{r}tidyr::drop_na(IRIS, Species)tidyr::drop_na(IRIS, matches("Sepal"))```#### data.table```{r}na.omit(IRIS, cols ="Species")na.omit(IRIS, cols =str_subset(colnames(IRIS), "Sepal"))```:::<!-------------------------------------------------------->## UniteCombine multiple columns into a single one:::: {.panel-tabset group="framework"}#### Tidyverse```{r}mtcars |> tidyr::unite("x", gear, carb, sep ="_")```#### data.table```{r}copy(MT)[, x :=paste(gear, carb, sep ="_")][]```:::<!-------------------------------------------------------->## Separate / Extract### Separate wider (extract)```{r}(MT.ext <- MT[, .(x =str_c(gear, carb, sep ="_"))])```::: {.panel-tabset group="framework"}#### TidyverseBased on a delimiter:```{r}MT.ext |>separate_wider_delim(x, delim ="_", names =c("gear", "carb"))```Based on a regex:```{r}MT.ext |>separate_wider_regex(x, patterns =c(gear ="\\d{1}", "_", carb ="\\d{1}"))```Based on position:```{r}MT.ext |>separate_wider_position(x, widths =c(gear =1, delim =1, carb =1))```:::: {.callout-note}`separate_wider_*` supersedes both `extract` and `separate`.::::```{r}#| eval: false#| code-fold: true#| code-summary: Old syntaxtidyr::separate(MT.ext, x, into =c("gear", "carb"), sep ="_", remove =TRUE)tidyr::extract(MT.ext, x, into =c("gear", "carb"), regex ="(.*)_(.*)", remove =TRUE)```#### data.tableBased on a delimiter:```{r}copy(MT.ext)[, c("gear", "carb") :=tstrsplit(x, "_", fixed =TRUE)][] ```Based on a regex:```{r}copy(MT.ext)[, c("gear", "carb") :=str_extract_all(x, "\\d") |>list_transpose()][]```:::### Separate longer/rowsSeparating a row into multiple rows, duplicating the rest of the values.**Data**```{r}(SP <-data.table(val =c(1,"2,3",4), date =as.Date(c("2020-01-01", "2020-01-02", "2020-01-03"), origin ="1970-01-01") ))```::: {.panel-tabset group="framework"}#### TidyverseBased on a delimiter:```{r}SP |>separate_longer_delim(val, delim =",")```Based on position:```{r}SP |>separate_longer_position(val, width =1) |>filter(val !=",")```:::: {.callout-warning}`separate_longer_*` now supersedes `separate_rows`::::```{r}#| eval: false#| code-fold: true#| code-summary: Old syntaxSP |>separate_rows(val, sep =",", convert =TRUE)```#### data.table**Solution 1:**```{r}copy(SP)[, c(V1 =strsplit(val, ",", fixed =TRUE), .SD), by = val][, let(val = V1, V1 =NULL)][]```**Solution 2:**```{r}SP[, strsplit(val, ",", fixed =TRUE), by = val][SP, on ="val"][, let(val = V1, V1 =NULL)][]```**Solution 3:**_(With type conversion)_```{r}SP[, unlist(tstrsplit(val, ",", type.convert =TRUE)), by = val][SP, on ="val"][, let(val = V1, V1 =NULL)][]```**Solution 4:**```{r}copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))][, val :=strsplit(val, ","), by = val][]``````{r}#| eval: false#| echo: false# copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))][, val := unlist(strsplit(SP$val, ","))][]```_(With type conversion)_```{r}#| eval: false#| echo: false# copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))# ][, val := utils::type.convert(unlist(strsplit(SP$val, ",")), as.is = T, na.strings = "")][]``````{r}copy(SP)[rep(1:.N, lengths(strsplit(val, ","))) ][, val :=strsplit(val, ","), by = val ][, val := utils::type.convert(val, as.is = T)][]```:::<!-------------------------------------------------------->## Duplicates### Duplicated rows#### Only keeping duplicated rows::: {.panel-tabset group="framework"}#### Tidyverse```{r}mtcars |>filter(n() >1, .by =c(mpg, hp))```#### data.table```{r}MT[, if(.N >1) .SD, by = .(mpg, hp)]```:::#### Removing duplicated rows:::{.callout-note}This is different from distinct/unique, which will keep one of the duplicated rows of each group.This removes all groups which have duplicated rows.:::::: {.panel-tabset group="framework"}#### Tidyverse```{r}mtcars |>filter(n() ==1, .by =c(mpg, hp))``````{r}#| eval: false#| code-fold: true# More convolutedmtcars |>filter(n() >1, .by =c(mpg, hp)) |>anti_join(mtcars, y = _)```#### data.table```{r}MT[, if(.N ==1) .SD, by = .(mpg, hp)]``````{r}#| eval: false#| code-fold: true# More convolutedMT[!MT[, if(.N >1) .SD, by = .(mpg, hp)], on =colnames(MT)]fsetdiff(MT, setcolorder(MT[, if(.N >1) .SD, by = .(mpg, hp)], colnames(MT)))```:::### Duplicated values (per row)```{r}(DUPED <-data.table(A =c("A1", "A2", "B3", "A4"), B =c("B1", "B2", "B3", "B4"), C =c("A1", "C2", "D3", "C4"), D =c("A1", "D2", "D3", "D4") ))```::: {.panel-tabset group="framework"}#### Tidyverse```{r}mutate(DUPED, Repeats =apply(pick(everything()), 1, \(r) r[which(duplicated(r))] |>unique() |>str_c(collapse =", ") ))```#### data.table```{r}copy(DUPED)[ , Repeats :=apply(.SD, 1, \(r) r[which(duplicated(r))] |>unique() |>str_c(collapse =", ")) ][]```:::**With duplication counter:**```{r}dup_counts <-function(v) { rles <-as.data.table(unclass(rle(v[which(duplicated(v))])))[, lengths := lengths +1]paste(apply(rles, 1, \(r) paste0(r[2], " (", r[1], ")")), collapse =", ")}```::: {.panel-tabset group="framework"}#### Tidyverse```{r}DUPED |>mutate(Repeats =apply(pick(everything()), 1, \(r) dup_counts(r)))```#### data.table```{r}DUPED[, Repeats :=apply(.SD, 1, \(r) dup_counts(r))][]```:::<!-------------------------------------------------------->## Expand & CompleteHere, we are missing an entry for person B on year 2010, that we want to fill:```{r}(CAR <-data.table(year =c(2010,2011,2012,2013,2014,2015,2011,2012,2013,2014,2015), person =c("A","A","A","A","A","A", "B","B","B","B","B"),car =c("BMW", "BMW", "AUDI", "AUDI", "AUDI", "Mercedes", "Citroen","Citroen", "Citroen", "Toyota", "Toyota") ))```### Expand::: {.panel-tabset group="framework"}#### Tidyverse```{r}tidyr::expand(CAR, person, year)```#### data.table```{r}CJ(CAR$person, CAR$year, unique =TRUE)```:::### CompleteJoins the original dataset with the expanded one:::: {.panel-tabset group="framework"}#### Tidyverse```{r}CAR |> tidyr::complete(person, year)```#### data.table```{r}CAR[CJ(person, year, unique =TRUE), on = .(person, year)]```:::<!-------------------------------------------------------->## UncountDuplicating aggregated rows to get back the un-aggregated version.**Data**```{r}#| echo: false#| output: falsedat_agg <- readr::read_table("Site Domain Mild Moderate Severe23 A1 4 0 027 A1 0 1 128 A1 0 1 029 A1 0 0 131 A1 0 1 033 A1 0 1 141 A1 3 0 148 A1 0 2 464 A1 1 0 066 A1 1 0 0") |>mutate(ID =row_number(), .before =1)DAT_AGG <-as.data.table(dat_agg)``````{r}cols <-c("Mild", "Moderate", "Severe")dat_agg```::: {.panel-tabset group="framework"}#### Tidyverse```{r}dat_agg |>pivot_longer(cols =all_of(cols), names_to ="Severity", values_to ="Count") |>uncount(Count) |>mutate(ID_new =row_number(), .after ="ID") |>pivot_wider(names_from ="Severity", values_from ="Severity", values_fn = \(x) ifelse(is.na(x), 0, 1), values_fill =0 )```#### data.table**Solution 1:**```{r}(melt(DAT_AGG, measure.vars = cols, variable.name ="Severity", value.name ="Count") [rep(1:.N, Count)][, ID_new := .I] |>dcast(... ~ Severity, value.var ="Severity", fun.agg = \(x) ifelse(is.na(x), 0, 1), fill =0)|> _[, -"Count"])```**Solution 2:**```{r}DAT_AGG[Reduce(`c`, sapply(mget(cols), \(x) rep(1:.N, x))) ][, (cols) :=lapply(.SD, \(x) ifelse(x >1, 1, x)), .SDcols = cols ][order(ID)]```:::<!-------------------------------------------------------->## List / UnlistWhen a column contains a simple vector/list of values (of the same type, without structure)### One listed column**Single ID (grouping) column:**Data:```{r}#| echo: falseMT_LIST <- MT[, .(mpg = .(mpg)), keyby = cyl]mt_list <- mtcars |>summarize(mpg =list(mpg), .by = cyl)``````{r}MT_LIST```::: {.panel-tabset group="framework"}#### Tidyverse```{r}mt_list |>unnest(cols = mpg)```#### data.table```{r}MT_LIST[, .(mpg =unlist(mpg)), keyby = cyl]```Alternative that bypasses the need of grouping when unlisting by growing the `data.table` back to its original number of rows before unlisting:```{r}MT_LIST[rep(MT_LIST[, .I], lengths(mpg))][, mpg :=unlist(MT_LIST$mpg)][]```:::**Multiple ID (grouping) columns:**Data:```{r}#| echo: falsemt_list2 <- mtcars |>summarize(mpg =list(mpg), .by =c(cyl, gear))MT_LIST2 <- MT[, .(mpg = .(mpg)), keyby = .(cyl, gear)]``````{r}mt_list2```::: {.panel-tabset group="framework"}#### Tidyverse```{r}mt_list2 |>unnest(cols = mpg) # group_by(cyl, gear) is optional```#### data.table_Solution 1:_```{r}MT_LIST2[, .(mpg =unlist(mpg)), by =setdiff(colnames(MT_LIST2), 'mpg')]```_Solution 2:_```{r}MT_LIST2[rep(MT_LIST2[, .I], lengths(mpg))][, mpg :=unlist(MT_LIST2$mpg)][]```:::### Multiple listed columnData:```{r}#| echo: falsemt_list_mult <- mtcars |>summarize(across(c(mpg, disp), \(c) list(c)), .by =c(cyl, gear))MT_LIST_MULT <- MT[, lapply(.SD, \(c) .(c)), keyby = .(cyl, gear), .SDcols =c("mpg", "disp")]``````{r}mt_list_mult```::: {.panel-tabset group="framework"}#### Tidyverse```{r}mt_list_mult |>unnest(cols =c(mpg, disp)) # group_by(cyl, gear) is optional```#### data.table```{r}MT_LIST_MULT[, lapply(.SD, \(c) unlist(c)), by =setdiff(colnames(MT_LIST_MULT), c("mpg", "disp"))]```:::<!-------------------------------------------------------->## Nest / UnnestWhen a column contains a data.table/data.frame (with multiple columns, structured)### One nested column**Nesting**::: {.panel-tabset group="framework"}#### Tidyverse```{r}mtcars |> tidyr::nest(data =-cyl) # Data is inside tibbles``````{r}#| eval: false#| code-fold: truemtcars |>nest_by(cyl) |>ungroup() # Data is inside vctrs_list_of. Returns a rowwise() df```Nesting while keeping the grouping variable inside the nested tables:```{r}mtcars |> tidyr::nest(data =everything(), .by = cyl)```#### data.table```{r}MT[, .(data = .(.SD)), keyby = cyl]```Nesting while keeping the grouping variable inside the nested tables:```{r}MT[, .(data =list(data.table(cyl, .SD))), keyby = cyl]```:::**Unnesting**Data:```{r}mtcars_nest <- mtcars |> tidyr::nest(data =-cyl)MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl]```::: {.panel-tabset group="framework"}#### Tidyverse```{r}mtcars_nest |>unnest(cols = data) |>ungroup()```#### data.table```{r}MT_NEST[, rbindlist(data), keyby = cyl] # MT_NEST[, do.call(c, data), keyby = cyl]```:::### Multiple nested column**Nesting:**::: {.panel-tabset group="framework"}#### Tidyverse```{r}(mtcars |>nest(data1 =c(mpg, hp), data2 =!c(cyl, gear, mpg, hp), .by =c(cyl, gear)) -> mt_nest_mult)```#### data.table```{r}(MT[, .(data1 = .(.SD[, .(mpg, hp)]), data2 = .(.SD[, !c("mpg", "hp")])), by = .(cyl, gear)] -> MT_NEST_MULT)```:::**Unnesting:**::: {.panel-tabset group="framework"}#### Tidyverse```{r}mt_nest_mult |>unnest(cols =c(data1, data2))```Using a pattern to specify the columns to unnest:```{r}mt_nest_mult |>unnest(cols =matches("data"))```#### data.table```{r}MT_NEST_MULT[, c(rbindlist(data1), rbindlist(data2)), keyby = .(cyl, gear)]```Using a pattern to specify the columns to unnest:```{r}MT_NEST_MULT[, do.call(c, unname(lapply(.SD, \(c) rbindlist(c)))), .SDcols =patterns('data'), keyby = .(cyl, gear)]``````{r}#| echo: false#| code-fold: trueMT_NEST_MULT[, list_cbind(unname(lapply(.SD, \(c) rbindlist(c)))), .SDcols =patterns('data'), keyby = .(cyl, gear)]```:::### Operate on nested/list columns**Data:**```{r}#| echo: falsemt_nest <- mtcars |>nest(data =-cyl)MT_NEST <- MT[, .(data = .(.SD)), by = cyl]``````{r}mt_nest```**Creating a new column using the nested data:**::: {.panel-tabset group="framework"}#### TidyverseKeeping the nested column:```{r}mt_nest |>mutate(sum =sum(unlist(data)), .by = cyl)```Dropping the nested column:```{r}mt_nest |>summarize(sum =sum(unlist(data)), .by = cyl)```#### data.tableKeeping the nested column:```{r}copy(MT_NEST)[, sum :=sapply(data, \(r) sum(r)), by = cyl][]```Dropping the nested column:```{r}MT_NEST[, .(sum =sapply(data, \(r) sum(r))), by = cyl]```:::**Creating multiple new columns using the nested data:**```{r}linreg <- \(data) lm(mpg ~ hp, data = data) |> broom::tidy()```::: {.panel-tabset group="framework"}#### Tidyverse```{r}mt_nest |>group_by(cyl) |>group_modify(\(d, g) linreg(unnest(d, everything()))) |>ungroup()```#### data.table```{r}MT_NEST[, rbindlist(lapply(data, \(ndt) linreg(ndt))), keyby = cyl][]```:::**Operating inside the nested data:**::: {.panel-tabset group="framework"}#### Tidyverse```{r}mt_nest |>mutate(data =map(data, \(t) mutate(t, sum =pmap_dbl(pick(everything()), sum)))) |>unnest(data)``````{r}#| eval: false#| code-fold: truemt_nest |>mutate(across(data, \(ts) map(ts, \(t) mutate(t, sum =apply(pick(everything()), 1, sum))))) |>unnest(data)``````{r}#| eval: false#| code-fold: true#| code-summary: Using the `nplyr` packagelibrary(nplyr)mt_nest |> nplyr::nest_mutate(data, sum =apply(pick(everything()), 1, sum)) |>unnest(data)```#### data.table```{r}copy(MT_NEST)[, data :=lapply(data, \(dt) dt[, sum :=apply(.SD, 1, sum)]) ][, rbindlist(data), keyby = cyl]```:::<!-------------------------------------------------------->## Rotate / Transpose```{r}(MT_SUMMARY <- MT[, tidy(summary(mpg)), by = cyl])```**Using pivots:**::: {.panel-tabset group="framework"}#### Tidyverse```{r}MT_SUMMARY |>pivot_longer(!cyl, names_to ="Statistic") |>pivot_wider(id_cols ="Statistic", names_from ="cyl", names_prefix ="Cyl ")```#### data.table```{r}MT_SUMMARY |>melt(id.vars ="cyl", variable.name ="Statistic") |>dcast(Statistic ~paste0("Cyl ", cyl))```:::**With dedicated functions:**::: {.panel-tabset group="framework"}#### Tidyverse```{r}# No function exists to do this AFAIK```#### data.table```{r}data.table::transpose(MT_SUMMARY, keep.names ="Statistic", make.names =1)```:::***
As before, within() requires the first range to be within the second by default, meaning the first table must be the one with the smaller range. Using x$col and y$col resolves the issue of column order.